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enhancements to machine learning procedures. Researchers have proposed 
several methods in evaluating and learning biological data. Genetic algorithm 
Keywords: (GA) as a feature selection process is used in this study to fetch relevant 
information from the RNA-Seq Mosquito Anopheles gambiae malaria vector 
dataset, and evaluates the results using kth nearest neighbor (KNN) and 
decision tree classification algorithms. The experimental results obtained a 
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1. INTRODUCTION 

Next-generation high-throughput sequencing technology has created profuse wide-ranging datasets, 
this enormous data expanse helps biologists to analyze and perform daunting gene transcripts, such as disease 
related and RNA such as infections (malaria), cancer, inherited, genetics, physiology, among others [1]. 
Blood-sucking mosquitoes such as Mosquito Anopheles with vectors of malaria plasmodium falciparum are 
found in Africa. Mosquito Anopheles is a deadly malaria parasite, responsible for demises of thousands of 
humans daily. Antimalaria combat suppositories blowouts, state-of-the-art antimalarials treatment upsurges, 
fetching for ground-breaking medications requires improved biotic studies of this infenctions. The parasite 
tolerates precise parameter of gene expression query enormously and necessitates making enhanced thorough 
extrapolative model transcriptions of vectors [2]. 

Approachable revealing genetic inquiries have been made in ribonucleic acid sequencing (RNA-seq) 
study by unfolding a cautious purposeful biological strategy for enhancement of the learning. RNA-Seq data 
requires removal of expletive high-dimension, such as; noises, complaints, repetition, irrelevant, inactivity, 
unfitting data, and others [3]. New capabilities strengthen solutions to the development of ground-breaking 
healthcare frameworks such as effective public wellbeing nursing systems, advanced interventions and medical 
diagnosis and disorders [4]. 

Machine learning means have been established with convincing uniqueness to investigate the 
enormous amount of cutting-edge RNA-Seq knowledge by studying the naturally material structures [5]. 
Scientists have used machine learning algorithms with relevant achievement for gene expression data results 
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of RNA-Seq [6-8]. In this study, a genetic algorithm (GA) pre-processor, to obtain reduced dimensionality of 
data with kth nearest neighbor (KNN) and decision tree classifiers are proposed to classify discrete genetic 
structures and obtain advances that are suitable system for predicting and detecting innovative genes for malaria 
ailments in human. 


2. REVIEWS 

Computational procedures are based on enormous samples of individuals genes with or without 
diseases, mutations may be found accountable for the precense of diseases. Differential expressed genes (DEG) 
are defined through some methods. Machine Learning measures are important for spotting the variation 
between genes found from human genome. Machine learning techniques have been emulated severally in 
investigating and classifying various profiles of diseases gene expressions. Various machine learning 
approaches are reported and reviewed, using recent trends in the evaluations [4]. 

Machine learning for predicting Autism spectrum ailment was experimented and classify transcripts, 
using RNA data from gene omnibus expression data. This study ranked cluster analysis and relatively 
discriminated, using SVM and KNN classifiers, an estimate accuracy of 94% was achieved [9]. Clustering and 
classification of RNA-Seq data was carried out by performing a mutual valuation, and emphasizing the 
expertise and ploys of methods occurring in recent time as predominant shifts, uing nonlinear and linear 
dimension reduction systems, by combining scRNA-seq data [10]. Group of RNA-Seq genes for ranking genes 
set of huge ensembles using a supervised learning approach was carried out using random forests classification 
method, on 1210 samples of tumor RNA-Seq datasets showed hidden supervised learning selection approaches 
necessity on analysis [11]. A supervised single-cell RNA-Seq data classification model was proposed using a 
comprehensive approach by combining independent feature selection approaches. scPred RNA-seq datasets 
showed high accuracy [12]. RNA-DNA machine learning analysis was proposed to indicate small genome 
expression to influence PAH ailment, feature selection algorithm was proposed to classify relevant genes with 
an outcome that reveals unique PAH [13]. 

Stomach tumor gene expression data using CNN classification procedure was developed based on 
deep learning approach, 60,000 data made up of stomach tumor genes were evaluated using PCA), heatmaps, 
and CNN algorithms with an accuracy of 96% and 51% [14]. RNA-Seq hidden transcripts in malaria parasites 
was proposed by relating variations of procedures to deconvolute transcriptional differences for distinct 
mosquitos and revealed hidden distinct transcriptional signatures [15]. 

An ensemble classification algorithm for cancer dataset was developed using decision tree, ensemble 
decision trees algorithms on available cancerous microarray, the results enhances than the decision trees 
classification [16]. An investigative cancer gene expression ensemble classification method was 
proposed using a hybrid RFE-Adaboost algorithm to fetch significant features for enhancing classification 
performancet [17]. Classification of cancer data was carried out using an effective ensemble classification 
method by increasing the classification, the result were less contingent [18]. A metaheuristics system for 
fetching relevant RNA/DNA data genes for classification was proposed by briefing recent developments of 
metaheuristic-based methods in embedded feature selection methods, useful data for operatives for ranking 
coefficients of SVM classifier is used [19]. A GA presenting a state-of-the-art approach was proposed using 
filter-wrapper based feature selection on five biological datasets, the results showed an important reduction of 
features for classification [20]. An enhanced ensemble classification for certain features was proposed for 
learning an ensemble-based feature selection approach with random trees using a subset, the method removes 
the unfitting structures and picks the best structures by means of a probability weighing value for classification 
evaluation using RF, SVM, and NB [21]. Review of several feature extraction algorithms for gene expression 
investigation, such as the PCA, ICA, PLS, and LLE was carried out and discussed for the purpose of the 
machine learning applications [22]. 


3. MATERIALS AND METHODS 

Several high-dimensional data enhancement methods are in place, this paper carries out a feature 
selection using GA technique and ensemble classification algorithm for fetching relevant information in a huge 
dimensional data and classification. A western Kenya RNA-Seq data mosquitos’ genes with 2457 instances 
with 7 gene attributes [23], MATLAB environment tool is used carry out the experiment using GA to select 
relevant subset of features from the dataset as shown in Table 1, Ensemble algorithm approach is used as a 
classifier on the selected features [24]. 
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Table 1. Features of the data 


Dataset Attributes Instances 
Mosquito Anopheles Gambiae 7 2457 


3.1. GA 

GA is a wrapper-based feature selection approach that examines suitable features from a given high 
dimensional datasets, with numerous parameter procedures, where mutation and crossover operators are 
associated with relevant recognized binary constraints features [19]. RNA features with M number denotes a 
feature having selected and unselected values 1 and 0 respectively. GA is important and helps to find feature 


subset models with selected figure features for composite classifications. GA structure is adopted and 
well-defined in Algorithm 1 [20]: 


Algorithm 1. Genetic Algorithm 
Necessitate: Set parameters nPop = m, tmax, t = 0; 
Confirm: Optimal feature subset with the maximum suitable value. 


1: while (t<=tmax) do 

a Create pop m, Umax’ 

no For k = 1 to m do 

4; Parents [m,, m] = system selection (m, nPop) 
ee Child = Xor [m, mo] 

6: Mu = mutation [Child} 

Ja End for 

8: Replace m with Childi, Childz2, .., Childm 
9: E = C+- 1y 

10: End while 

Eis Store the Highest fitness value; 


m = population size, r = random number 0 to 1, chrome = certain or non-certain feature through threshold 6, 
set value = 0.5, and a= threshold number of picked features. Selecting maximum fit features from the 
predictable datasets is the main problem of the GA technique. 


3.2. KNN 

A supervised learning K-nearest neighbor classification technique for gene datasets, performs 
neighborhood classification evaluation value of innovative application occurrence. KNN algorithm classifies 
innovative entity developed on instances, attributes as well as training models. KNN classifiers do not train 
models to fit but built on retention. The features selected are assumed as input to segments. The K value of 
nearest neighbors are selected nearest to the query spot. Detachment between query-instance and training 
models are considered and sorted based on the K® minimum determined distance. Group Y of the nearest 
neighbors is fetched. The unassuming common of the group of nearest neighbors as the estimate amount of the 
query instance is used. Bonds can fragment randomly [25]. 


3.3. Decision trees 

Decision tree classification algorithm divides recursively instance spacing with hyperplanes 
orthogonally. Decision tree model assembles derivative nodes signifying attributes, based on instance space 
attribute value roles selected inversely for algorithms, using its values. Advanced data sub-space iteratively 
divides till end principle is determined and terminal nodes (leaf nodes) are allocated to class labels 
characterizing the classification. Accurate conventional end procedure is a significant tree with too huge, 
overfitted and trivial trees, underfitted and suffers loss in accuracy. Algorithms have assembled overfitting 
strategies, labelled trimming, classifying new instances by leading the tree basis down a leaf, with respect to 
the examination result along the pathway [26]. Competent models are discovered using decision tree classifiers 
and ensembles, with unbalanced varying trained datasets, with resultant models totally unalike. 


3.4. Performance evaluation and applications 

Machine learning model need evaluation and validation of performance metrics using a confusion 
matrix and its formula [4, 27]. Expression of gene analysis suggest enhanced RNA-Seq data path identification, 
to learn applicable helpful genes in advancing applications such as treatment modifications, diseases diagnosis, 
drugs and gene discoveries, classification of cancers, typhoid, malaria, among other ailments. Designs and 
inconsistency findings between machine learning data has discovered great algorithms applicable to many 
fields such as engineering, banking, health sectors among others. MATLAB 2015A is proposed as an 
experimental and executing tool for the prognosis of malaria infections on an iCore2 processor, 4GB RAM 
size, 64-bit System. 
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4. RESULTS AND ANALYSIS 

In this study, 2457 instances of RNA-Seq dataset “Mosquitoes Anopheles Gambiae” containing 
resistants and susceptibles of genes is used on a GA to draw optimal reduced number of subsets in the data, 
taking away uncorrelated attributes to pick maximum variance features. The result shows important gene 
evidence suitable for KNN and decision tree classification algorithm study on MATLAB environment for the 
model experiment. Genetic algorithm makes use of 0.5 threshold and achieves 708 optimal subset features of 
significant genes. Classifiers used 10-folds cross validation was used on KNN and decision tree classifiers, to 
implement evaluations of the model’s performance with 0.05 holdout training data parameter and classifier 
accuracy tests the data with 25%. A learning classification procedure evaluation train and test evaluates using 
10-fold cross validation to remove the partiality in sampling. The performance metrics and time computation 
is evaluated [27] and relates the model classification performance, by means of KNN (bagging) and decision 
tree, with 98.3% and 88.3% accuracy separately using confusion matrix and result outpus as shown in 
Figure 1. Related components were fetched by GA from the full data shown in Figure 1, the subset data features 
pass into KNN as well as decision tree and shows the Confusion matix result in the Figure 2 and Figure 3 to 
derive the solution to the performance metrics. KNN classification algorithm achieves an accuracy of 88.3%, 
while the decision tree classification algorithm achieved an accuracy of 98.3%, metrics of other performance 
are shown in Table 1. 
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Figure 1. Loaded mosquito anopheles gambiae on MATLAB environment 


In this study, RNA-Seq data uses a mosquito anopheles gambiae dataset [28], to test the machine 
learning method performance. Genetic algorithm dimensionality reduction model selects 708 subset features 
from 2457 features of genes form the data. The selected components were classified using classification 
algorithms (KNN and decision tree) performance evaluation. The efficiency of machine learning approach in 
genes are shown in the results to confirm the method, the outcomes are revealed and related in Table 2 showing 
GA-decision tree outperforms GA-KNN terms of accuracy. In this study, an improved classification of malaria 
vector data is analyzed using GA with decision tree and KNN algorithms respectivel, numerous works have 
been reviewed, the results prove that GA enhances classification yield for KNN and decision tree. 
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Confusion Matrix for: Decision Tree 


True class 





TPR / FNR 
Fredicted class 


Figure 2. RNA-Seq confusion matrix using decision tree algorithm TP=38; TN=21; FP=0; FN=1 


Confusion Matrix for: k-Nearest Neighbor 





TPR / FNR 


Predicted class 


Figure 3. RNA-Seq confusion matrix using KNN TP=36; TN=17; FP=4; FN=3 


Table 2. Performance metrics table for the confusion matrix 
Performance Metrics GA-Decision Tree Classification GA-KNN Classification 


Accuracy (%) 98.3 88.3 
Sensitivity (%) 97.4 92.3 
Specificity (%) 100 81.0 
Precision (%) 100 90.0 
Recall (%) 97.4 92.3 
F-Score (%) 98.7 91.1 


5. CONCLUSION 

In this study, improvements efficience for predicting and detecting malaria ailments in human are 
proposed using machine learning dimensionality reduction and classification techniques. GA feature selection 
dimensionality reduction and KNN and decision tree classifiers were employed by performing evaluating and 
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analysing the performance results obtained. This study enhanced malaria vector data classification, and 
compared with quite a lot of proposed works in reviews by numerous researchers, the outcomes demonstrates 
that, GA dimensionality reduction model helps to develop classification output such as decision tree. 
Investigating current works proposed in literature can improve feature selection models and algorithms and 
compared with recent other state-of-the-art classification algorithm. 
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